A Text Clustering Framework for Information Retrieval
نویسندگان
چکیده
Text-mining methods have become a key feature for homeland-security technologies, as they can help explore effectively increasing masses of digital documents in the search for relevant information. This research presents a model for document clustering that arranges unstructured documents into content-based homogeneous groups. The overall paradigm is hybrid because it combines pattern-recognition grouping algorithms with semanticdriven processing. First, a semantic-based metric measures distances between documents, by combining content-based and behavioral analysis. Such a metric allows taking into account the lexical properties, the structure and the styles characterizing the processed documents. In a second step, the model relies on a Radial Basis Function (RBF) kernel-based mapping for clustering documents. As a result, the major novelty aspect of the proposed approach is to exploit the implicit mapping of RBF kernel functions to tackle the crucial task of normalizing similarities, while embedding semantic information in the whole mechanism. Experimental results on Reuters and Newsgroup 20 databases validate the proposed approach.
منابع مشابه
A Method of Cluster-Based Indexing of Textual Data
This paper presents a framework for clustering in text-based information retrieval systems. The prominent feature of the proposed method is that documents, terms, and other related elements of textual information are clustered simultaneously into small overlapping clusters. In the paper, the mathematical formulation and implementation of the clustering method are briefly introduced, together wi...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملChallenging Issues and Similarity Measures for Web Document Clustering
Web itself contains a large amount of documents available in electronic form. The available documents are in various forms and the information in them is not in organized form. The lack of organization of materials in the WWW motivates people to automatically manage the huge amount of information. Textmining refers generally to the process of extracting interesting and non-trivial information a...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملModel Formulation: A Document Clustering and Ranking System for Exploring MEDLINE Citations
OBJECTIVE A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different to...
متن کاملA Dynamic and Semantically-Aware Technique for Document Clustering in Biomedical Literature
As an unsupervised learning process, document clustering has been used to improve information retrieval performance by grouping similar documents and to help text mining approaches by providing a high-quality input for them. In this paper, the authors propose a novel hybrid clustering technique that incorporates semantic smoothing of document models into a neural network framework. Recently, it...
متن کامل